Skip to content

Add UN M.49 region containment support#193

Merged
ptr727 merged 3 commits into
developfrom
feature/unm49-containment-175
Jun 26, 2026
Merged

Add UN M.49 region containment support#193
ptr727 merged 3 commits into
developfrom
feature/unm49-containment-175

Conversation

@ptr727

@ptr727 ptr727 commented Jun 26, 2026

Copy link
Copy Markdown
Owner

Resolves #175.

What

Matching a UN M.49 region group against a contained region now works, e.g. es-419 (Latin America and the Caribbean) matches es-MX (Mexico).

Changes

  • New UnM49Data dataset sourced from Unicode CLDR territoryContainment, parsed with XmlReader (reflection-free, keeps the library AOT compatible — XmlSerializer is avoided), then JSON converted and code generated following the existing ISO/RFC dataset pattern.
  • LanguageLookup.IsMatch(prefix, tag, regionContainment) — opt-in overload. The existing two-argument IsMatch is unchanged and delegates with regionContainment: false, so there is no behaviour change and no extra work in the default path. Matching is directional: the broad group in the prefix matches the specific region in the tag, not the reverse.
  • LanguageLookup.ExpandRegion — expands a tag region into the tag plus a variant per containing UN M.49 group, e.g. es-MXes-013, es-419, es-019, es-001.
  • Parser fixValidateExtendedLanguage now requires 3 alpha per RFC 5646 (extlang = 3ALPHA), so a numeric region following the language parses as a region instead of an extended language, e.g. es-419.
  • Restored previously trimmed control-flow comments in the parser and lookup.
  • Docs: README usage example + references, HISTORY entry, version.json bumped to 1.4.

Design notes

  • 419 is a grouping="true" overlay in CLDR (not a canonical tree node), so the loader keeps grouping overlays and skips only status="deprecated" entries — dropping overlays would remove 419 entirely.
  • 001 (World) is the universal ancestor, so es-001 matches any es tag with a region; documented in the API and covered by a test.

Testing

  • New UnM49Tests (dataset round-trip, Find, Contains, GetAncestors) and LanguageLookupTests theories covering the containment matching and ExpandRegion, including the directional and backward-compatibility cases.
  • Full suite: 293/293 passing. Build clean with AOT reference verification enabled.

🤖 Generated with Claude Code

Resolve #175 - match a UN M.49 region group against a contained region,
e.g. es-419 (Latin America) matches es-MX (Mexico).

- Add UnM49Data sourced from the CLDR territoryContainment data, parsed
  with XmlReader to keep the library AOT compatible, generated and
  embedded following the existing dataset pattern.
- Add LanguageLookup.IsMatch(prefix, tag, regionContainment) as an opt-in
  overload; the existing two argument IsMatch is unchanged.
- Add LanguageLookup.ExpandRegion to expand a region into its containing
  UN M.49 groups.
- Fix ValidateExtendedLanguage to require 3 alpha so a numeric region
  following the language parses as a region, e.g. es-419.
- Restore previously trimmed control flow comments in the parser and lookup.
- Update README and HISTORY, add UN M.49 references, bump version to 1.4.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings June 26, 2026 16:11

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds UN M.49 (CLDR territoryContainment) region containment support to the LanguageTags library so numeric region groups (e.g., 419) can be matched against contained regions (e.g., MX), and enables region expansion to ancestor group regions.

Changes:

  • Introduces UnM49Data (XML loader + JSON/codegen + embedded generated dataset) sourced from Unicode CLDR territory containment.
  • Adds LanguageLookup.IsMatch(prefix, tag, regionContainment) overload and LanguageLookup.ExpandRegion() for containment-based matching/expansion.
  • Fixes parsing so extlang requires 3ALPHA, ensuring numeric regions like es-419 parse as a region.

Reviewed changes

Copilot reviewed 12 out of 13 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
version.json Bumps library version to 1.4.
README.md Documents UN M.49 containment matching and region expansion; updates links and release-notes snippet.
HISTORY.md Adds 1.4 release entry describing containment support and parser fix.
LanguageTagsCreate/CreateTagData.cs Extends codegen pipeline to download/convert/generate UN M.49 data and code.
LanguageTags/UnM49Data.cs Implements UN M.49 loader (XmlReader), JSON serialization, code generation, and query APIs (Find/Contains/GetAncestors).
LanguageTags/UnM49DataGen.cs Adds generated embedded UN M.49 dataset used by UnM49Data.Create().
LanguageData/unm49.json Adds generated JSON form of the UN M.49 dataset.
LanguageTags/LanguageSchema.cs Registers UnM49Data in the source-generated JsonSerializerContext.
LanguageTags/LanguageLookup.cs Adds containment-aware matching overload and ExpandRegion(); wires in UN M.49 dataset usage.
LanguageTags/LanguageTagParser.cs Tightens extlang validation to 3ALPHA so numeric regions don’t misparse as extlang; restores control-flow comments.
LanguageTagsTests/UnM49Tests.cs Adds tests for UN M.49 dataset loading/round-tripping and basic query behavior.
LanguageTagsTests/LanguageLookupTests.cs Adds tests for containment matching (directional + opt-in) and ExpandRegion().

Comment thread LanguageTags/LanguageLookup.cs
Comment thread LanguageTags/LanguageLookup.cs Outdated
Comment thread LanguageTags/UnM49Data.cs
Comment thread LanguageTags/UnM49Data.cs
- Region containment matching now substitutes the candidate region and
  reuses the plain matcher, preserving variant, extension, and private use
  semantics to avoid false positives, e.g. es-419-nedis no longer matches
  es-MX while es-419 still matches es-MX-nedis.
- Clarify UnM49Data.Find returns the first of possibly multiple records for
  a code, and point to GetAncestors/Contains for full transitive containment.
- Clarify UnM49Record.Code may be an alphabetic CLDR grouping code (EU, EZ, UN).
- Add region containment tests for script preservation and prefix variants.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 12 out of 13 changed files in this pull request and generated 2 comments.

Comment thread LanguageTags/LanguageLookup.cs
Comment thread README.md Outdated
"variant" has a specific RFC 5646 meaning, so describe the expanded
entries as region substituted tags instead, in the XML doc and README.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 12 out of 13 changed files in this pull request and generated no new comments.

@ptr727 ptr727 merged commit 10cc912 into develop Jun 26, 2026
12 checks passed
@ptr727 ptr727 deleted the feature/unm49-containment-175 branch June 26, 2026 16:53
ptr727 added a commit that referenced this pull request Jun 26, 2026
Release: promote develop to main

- UN M.49 region containment (#193, version floor 1.4)
- Release-workflow fixes re-synced from template (#196: #213/#214/#217)

Conflicts in three workflow files were the parallel setup-dotnet dependabot
bump (identical on both branches); resolved to develop's authoritative
versions (newer template re-sync + the workflow fixes).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants